Our world is rapidly urbanizing in practically all geographies. In 2018, the UN reported that 55% of the global population lived in urban areas and projected that number to rise to 68% by 2050. With this urbanization, access to green space and parkland is certain to decrease across geographies and across populations. With this general decline in access to nature, there is likely to be an increase in the equity of access, with the wealthy and privileged losing access at a lesser rate than the poor and underprivileged. There has been work done to assess the public health impacts of this inequity in the access to nature in the context of COIVD (Spotswood et al., 2021), as well as work done on the “luxury effect,” which shows that wealthy neighborhoods can support higher biodiversity in some cases due largely to the increased presence of park land (Magel et al., 2021). In this analysis, I seek to understanding the relationship between access to green space and exposure to potentially harmful pollutants. I will also examine the relationship between green space and income. The analysis will specifically be done in the Bay Area of California, a collection of 9 counties surrounding San Francisco Bay on the Central Coast of California, USA.
Intuitively, it makes sense to hypothesize that, in an urban setting like the Bay Area, there would be a relationship between exposure to pollution and contamination and the relative urbanization of a location. We see natural landscapes as clean and urban hardscapes as dirty, and can make assumptions as to which location is healthier to live next to. Below I quantify that relationship using the spatial distribution of parks in San Francisco, summary data from the CalEnviroScreen model, and census blocks and census block group geometries.
In developing this analysi, 4 key datasets are leveraged: 1) US Census American Communities Survey (ACS) Income Dataset, 2) US Census Tiger Geometries, 3) CalEnviroScreen, 4) Public Lands Trust Parks Dataset.
Table B19001 from the US Census Bureau (data.census.gov) is used to determine income levels in each census block. Income levels in this table are presented as discrete counts of the population within certain income bands. To account for this unique arrangement of data, we define our variable of “income” for later regressions as the percent of respondents to the survey that make more than $100k annually (USD). The threshold of $100k is chosen a) as a clean break that is well-defined by the natural arrangement of the income categories and b) the mean income of the Bay Area is roughly $100k.
Originally, this analysis was intended to be performed with building footprints provided by the city of San Francisco. However, for practical purposes, census geometries were chosen instead. First, the building footprints dataset is extremely large and detailed, and very long processing times presented a limitation early in the development of this analysis, prompting the switch to a dataset with lower resolution and greater spatial extent. Second, the building footprints included residences, commercial structures, government strucutres, and retail structures. Given the urban setting, many (possibly most) of the structures that were residential were multi-family, which would bias the analysis against people living in apartments, duplexs, etc. Third, becasue CalEnviroScreen and the ACS income data are presented at the Census BlockGroup and Block levels, respectively, the building footprints would need to be rolled up to lesser spatial extents anyway. Fourth, because of the roll-up requirement and the limited extent of the building footprints being used (San Francisco), when rolling up to Census geometries, sample size would be diminished or later regressions.
Therefore, the decision was made to focus on Census Blocks (for the ACS Income analysis) and Census BlockGroups (for the CalEnviroScreen analysis) and to expand the analysis to the entire Bay Area of California. This allows for a more generalizable analysis and a more meaningful sample set. When developed, the Blocks dataset was filtered to eliminate all Blocks with a land area value of 0.0 to eliminate the blocks in San Francisco that are all water area. Following this filter, the centroid was calculated for each Block, and this centroid was used to calculate the distance between a Block and the nearest park boundary. For the BlockGroups, the distances were averaged across the Blocks contained, and this average was assigned as the distance for that BlockGroup.
CalEnviroScreen 4.0 is used as a simplified index of general environmental contaminant exposure. While the index incorporates a variety of stressors and pollutants, the summary index was chosen as the most appropriate indicator, given the wide variety of communities and potential exposures throughout the sample area. The CalEnviroScreen 4.0 data is presented at the Census BlockGroup level, and therefore the summary value is compared against the mean distance to a park from the Block centroids contained within each BlockGroup.
Park boundaries were downloaded and incorporated as shapefiles provided by the Trust for Public Lands ParkServe Program. No modifications were done to the park boundaries dataset (with the exception of re-projection), and all data was retained for the nine counties encompassing the Bay Area. Importantly, data were not included for the counties bordering the Bay Area, and therefore it is possible that Blocks on the outer boundary of some Bay Area counties were incorrectly attributed with a “nearest park” that was erroneously far from them if the actual nearest park is in a county that was not included. This is assumed to be a negligible portion of the analysis, given the sample size.
Presented below are the result of our Regression Analyses below in Figures 2-5 and in the accompanying model results summaries. In both regression cases, no significant relationship is seen between distance to the nearest park boundary and CalEnviroScreen summary scores or percentage of ACS respondents with an income over $100k annually. In both cases, a linear model was fit to the data, and appears to be the most appropriate attempt to find a relationship both upon visual analysis of the scatterplots and density distributions of the residuals, which were both generally normal and centered roughly about 0. However, in both instances, R-squared values are both near zero and p-values are well above 0.05, indicating a lack of both predictability and significance.
I reason that this lack of predictive capacity can be explained by a visual analysis of Figure 1, which maps all park land in the Bay Area. As can be seen in this map, the Bay Area has an incredible density in Parks, some small and typical of urban settings, and others extremely large and unique to the geography and topography of the area. Because of this ubiquity in the distribution of parks throughout the nine counties analyzed, it is reasonably expected that linear distance to parks would be relatively uniform accross the entire region.